BUG/PERF: Series.combine_first converting int64 to float64 #51777

lukemanley · 2023-03-04T01:38:47Z

closes BUG: preserve the dtype on Series.combine_first #51764
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.1.0.rst file if fixing a bug or adding a new feature.

jessestone7 · 2023-03-04T03:39:59Z

If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints. For example:

>>> s1 = pd.Series([1666880195890293744, 1666880195890293837])
>>>
>>> s1
0    1666880195890293744
1    1666880195890293837
dtype: int64
>>>
>>> s1.astype(float).astype(int)
0    1666880195890293760
1    1666880195890293760
dtype: int64

It would be good to include big ints like these in the tests.

phofl · 2023-03-04T11:45:23Z

pandas/core/series.py

@@ -3272,7 +3274,13 @@ def combine_first(self, other) -> Series:
        if this.dtype.kind == "M" and other.dtype.kind != "M":
            other = to_datetime(other)

-        return this.where(notna(this), other)
+        combined = this.where(notna(this), other)


Not sure if this won't break anything, but for the series case there might be a proper fix. You could set a compatible fill_value for the reindex ops. Only disadvantage is, that you'll have to compute notna(this) somehow

lukemanley · 2023-03-04T15:02:32Z

If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints

great catch! thanks. I've updated the PR to handle this and updated the test.

This also provides a nice perf improvement:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 59.4 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)     -> main
# 1.37 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR

And I'll note the perf improvement is not specific to the integer case, here is float64:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randn(N), dtype="float64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randn(N), dtype="float64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 48.4 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)      -> main
# 1.64 ms ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR

phofl · 2023-03-04T15:32:48Z

pandas/core/series.py

-        return this.where(notna(this), other)
+        combined = concat([this, other]).reindex(new_index, copy=False)
+        combined.name = self.name
+        return combined


This needs a finalise call with self, otherwise we will loose metadata.

This might changed result ordering? Can we do a reindex in the end?

Updated, thanks

…t-preserve-dtype

phofl · 2023-03-07T22:23:21Z

pandas/core/series.py

+        this = self
+        null_mask = isna(this)
+        if null_mask.any():
+            drop = this.index[null_mask].intersection(other.index[notna(other)])


Can you add a quick comment what this is doing? Took me a while to figure it out, don't want to do this again in 6 months time :)

Otherwise lgtm

i've simplified this logic a bit. should be easier to follow now

…t-preserve-dtype

jessestone7 · 2023-03-09T03:43:47Z

It looks like you are using these big integers in the tests to test for loss of precision:

>>> s1 = Series([np.iinfo(np.int64).min, np.iinfo(np.int64).max])

But note that there is no loss in going from int to float to int for these special numbers:

>>> s1
0   -9223372036854775808
1    9223372036854775807
dtype: int64

>>> s1.astype(float).astype(int)
0   -9223372036854775808
1    9223372036854775807
dtype: int64

It would be good to use ints in the tests where loss of precision actually could happen, e.g.:

>>> s1 = pd.Series([1666880195890293744, 1666880195890293837])

>>> s1
0    1666880195890293744
1    1666880195890293837
dtype: int64

>>> s1.astype(float).astype(int)
0    1666880195890293760
1    1666880195890293760
dtype: int64

lukemanley · 2023-03-09T11:45:56Z

updated, thanks @jessestone7

mroeschke · 2023-03-10T18:57:16Z

Thanks again @lukemanley

jessestone7 · 2024-09-13T01:57:27Z

I just tried this on Pandas version 2.2.2 and I see that there is a loss of precision:

>>> a = pd.Series([1666880195890293744, 5]).to_frame()
>>> b = pd.Series([6, 7, 8]).to_frame()
>>> a
                     0
0  1666880195890293744
1                    5
>>> b
   0
0  6
1  7
2  8
>>> a.combine_first(b)
                     0
0  1666880195890293760
1                    5
2                    8

Series.combine_first to preserve dtype

d28c39c

lukemanley added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Mar 4, 2023

phofl reviewed Mar 4, 2023

View reviewed changes

avoid loss of precision

f60a3d6

lukemanley added the Performance Memory or execution speed performance label Mar 4, 2023

lukemanley changed the title ~~BUG: Series.combine_first converting int64 to float64~~ BUG/PERF: Series.combine_first converting int64 to float64 Mar 4, 2023

whatsnew

e10b62b

phofl reviewed Mar 4, 2023

View reviewed changes

lukemanley added 2 commits March 4, 2023 10:38

fix

d471b61

Merge remote-tracking branch 'upstream/main' into series-combine-firs…

5ea12f3

…t-preserve-dtype

phofl reviewed Mar 7, 2023

View reviewed changes

lukemanley added 3 commits March 7, 2023 19:19

simplify

b2e9a88

Merge remote-tracking branch 'upstream/main' into series-combine-firs…

eb51e4d

…t-preserve-dtype

Merge remote-tracking branch 'upstream/main' into series-combine-firs…

63fdc25

…t-preserve-dtype

update test

3e60a36

mroeschke approved these changes Mar 10, 2023

View reviewed changes

mroeschke merged commit 78947dd into pandas-dev:main Mar 10, 2023

lukemanley deleted the series-combine-first-preserve-dtype branch March 17, 2023 22:04

lukemanley mentioned this pull request Jun 17, 2023

BUG: Series.combine_first() changes int dtypes, but not Int (extension) dtypes #53700

Closed

3 tasks

jessestone7 mentioned this pull request Oct 30, 2024

BUG: Series.combine_first loss of precision #60128

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/PERF: Series.combine_first converting int64 to float64 #51777

BUG/PERF: Series.combine_first converting int64 to float64 #51777

lukemanley commented Mar 4, 2023

jessestone7 commented Mar 4, 2023

phofl Mar 4, 2023

lukemanley commented Mar 4, 2023 •

edited

Loading

phofl Mar 4, 2023

lukemanley Mar 4, 2023

phofl Mar 7, 2023

lukemanley Mar 8, 2023

jessestone7 commented Mar 9, 2023

lukemanley commented Mar 9, 2023

mroeschke commented Mar 10, 2023

jessestone7 commented Sep 13, 2024

BUG/PERF: Series.combine_first converting int64 to float64 #51777

BUG/PERF: Series.combine_first converting int64 to float64 #51777

Conversation

lukemanley commented Mar 4, 2023

jessestone7 commented Mar 4, 2023

phofl Mar 4, 2023

Choose a reason for hiding this comment

lukemanley commented Mar 4, 2023 • edited Loading

phofl Mar 4, 2023

Choose a reason for hiding this comment

lukemanley Mar 4, 2023

Choose a reason for hiding this comment

phofl Mar 7, 2023

Choose a reason for hiding this comment

lukemanley Mar 8, 2023

Choose a reason for hiding this comment

jessestone7 commented Mar 9, 2023

lukemanley commented Mar 9, 2023

mroeschke commented Mar 10, 2023

jessestone7 commented Sep 13, 2024

lukemanley commented Mar 4, 2023 •

edited

Loading